This research included data and analysis on incarceration rates in the United States from 1970 to 2018. The variable of interest will be ‘total jail pop,’ which represents the total number of people in prison. Furthermore, ‘year’ and’state’ variables will be included in the dataset for analysis so that a comparison may be made by year and state. In addition, variables that provide rates for populations of various races in jail will be included in the dataset so that race and incarceration may be compared. Finally, the dataset includes the ‘land area’ and ‘urbanicity’ variables to show a link between ‘total jail pop’ and ‘land area’ in terms of whether areas are urban, sub-urban, rural, or mid/small.
Table 1 summarizes the study’s numeric variables. The highest value of ‘total jail pop’ is 23467.19, while the smallest value is 0. The mean value of total jail population across all states and years is 161.121, with a standard deviation of 615.541, suggesting that observations in the variable are considerably distributed.
When comparing the averages of the jail population rates for different races, it can be seen that African Americans have the highest mean of 4393.668, meaning that African Americans have the greatest proportion of their population in jail of all races in the US. Native Americans are next, followed by Latin Americans, and finally white Americans, who have the lowest percentage of their population in prison.
All variables are severely skewed and have large tails, indicating that a log transformation is necessary to properly view the dataset, according to skewness and kurtosis statistics. A log transformation is a data transformation in which each variable x is replaced with a log (x). This inquiry will focus on the natural log transformation. The symbol ln represents the nature log. When continuous data does not follow the bell curve (normal distribution), it is possible to log convert it to make it as “normal” as possible, improving the statistical analysis findings. To put it another way, the log transformation reduces or removes the original data’s skewness.
# Read data
inc <- read_csv("incarceration_trends.csv")
# Select variables of interest
my_dat <- inc %>%
select(state, year, total_jail_pop, land_area, black_jail_pop_rate,
white_jail_pop_rate, latinx_jail_pop_rate, native_jail_pop_rate,
urbanicity)
## Summary statistics
summ <- basicStats(my_dat[,-c(1,2,9)])[c(3,4,7,8,13:16),]
t_summ <- as.data.frame(t(round(summ,3)))
names(t_summ) <- c("Min", "Max", "Mean", "Med", "Var", "St.dev", "Skew", "Kurt")
# Create table
knitr::kable(t_summ, caption = "Summary statistics of selected variables")
| Min | Max | Mean | Med | Var | St.dev | Skew | Kurt | |
|---|---|---|---|---|---|---|---|---|
| total_jail_pop | 0.00 | 23467.19 | 161.121 | 32.03 | 378890.4 | 615.541 | 15.665 | 377.657 |
| land_area | 2.05 | 145572.46 | 1125.570 | 616.81 | 13073141.6 | 3615.680 | 26.500 | 931.934 |
| black_jail_pop_rate | 0.00 | 2145000.00 | 4393.668 | 1138.15 | 523637334.7 | 22883.123 | 31.818 | 1983.550 |
| white_jail_pop_rate | 0.00 | 73887.49 | 318.500 | 209.88 | 828584.7 | 910.266 | 36.845 | 1974.251 |
| latinx_jail_pop_rate | 0.00 | 351162.79 | 1166.668 | 369.03 | 26871953.8 | 5183.817 | 21.304 | 752.294 |
| native_jail_pop_rate | 0.00 | 335500.00 | 1195.047 | 0.00 | 56915816.9 | 7544.257 | 19.707 | 549.899 |
Table 2 presents top 10 years with highest average total jail population. It is observed that the year 2008 had the highest number of incarcerations.
# Generate dataframe for average population by year. Select top 10 yrs
dat1 <- my_dat %>% group_by(year) %>%
summarise(Avg.Pop = mean(total_jail_pop, na.rm = T)) %>%
arrange(desc(Avg.Pop)) %>% head(10)
# Create table
knitr::kable(dat1,
caption= "Top 10 years with highest average total jail population")
| year | Avg.Pop |
|---|---|
| 2008 | 263.3505 |
| 2009 | 261.5556 |
| 2007 | 260.5793 |
| 2006 | 258.0700 |
| 2014 | 257.7243 |
| 2010 | 256.7282 |
| 2013 | 256.4916 |
| 2012 | 254.8216 |
| 2011 | 253.8052 |
| 2017 | 250.0884 |
Table 3 presents top 10 states with highest average total jail population. It is observed that the state of DC had the highest number of incarcerations with an average of 2315.2551.
# Generate dataframe for average population by state. Select top 10 states
dat2 <- my_dat %>% group_by(state) %>%
summarise(Avg.Pop = mean(total_jail_pop, na.rm = T)) %>%
arrange(desc(Avg.Pop)) %>% head(10)
# Generate table
knitr::kable(dat2,
caption = "Top 10 states with highest average total jail population")
| state | Avg.Pop |
|---|---|
| DC | 2315.2551 |
| CA | 1066.3031 |
| AZ | 564.1401 |
| NJ | 562.0384 |
| MA | 549.9082 |
| FL | 546.2006 |
| NY | 422.1652 |
| MD | 353.9957 |
| PA | 336.8743 |
| LA | 282.4048 |
The graph below shows how the average overall jail population has changed over time based on race. At first, the number of those incarcerated hardly changed. However, following 1984, there was a significant increase in incarceration rates, which peaked in 2008. This is followed by a steady decline in incarceration rates throughout 2018. The white race jail population climbed substantially from 2008 to 2018, while the black, Latino, and other race jail populations fell. In addition, white people are the most prevalent race in prison, followed by Black, Latino, Other, and Native Americans.
# Generate dataframe for average population by year
df1 <- inc %>%
dplyr::select(white_jail_pop, black_jail_pop, native_jail_pop, latinx_jail_pop,
other_race_jail_pop, year) %>%
dplyr::rename(White = "white_jail_pop", Black = "black_jail_pop",
Native = "native_jail_pop", Latin = "latinx_jail_pop",
Other = "other_race_jail_pop") %>%
reshape2::melt(id = "year") %>% dplyr::group_by(year, variable) %>%
summarise(Avg.Pop = mean(value, na.rm = T))
# Make plot
ggplot(df1, aes(x = year, y = Avg.Pop, color = variable)) +
geom_line() +
labs(x = "Year", y = "Average population", color = "Race",
title = "Evolution of average incacerations over the years") +
theme(plot.title = element_text(hjust=0.5))
The box plot below shows how the total jail population variable relates to the urbanicity variable. It is clear that rural areas have the smallest total jail population, whereas urban areas have the highest total jail population.
# Box plot
ggplot(my_dat, aes(x = urbanicity, y = log(total_jail_pop),
fill = urbanicity)) +
geom_boxplot() +
labs(x = "Region", y = "Log Total population",
title = "Relationsip between region and total jail population") +
theme(plot.title = element_text(hjust=0.5))
Finally, the shape file of the United States is retrieved from https://urbaninstitute.github.io/urbnmapr/ and loaded into r to create a geographical representation of the variable of interest. The average total jail population per state is then calculated for all years, and the result is combined with the shape file data to create a dataframe for charting. The map below depicts the distribution of the average total jail population in the several states.It is clear that California state has the highest average total jail population.
df2 <- my_dat %>% group_by(state) %>%
summarise(Avg.Pop = mean(total_jail_pop, na.rm = T)) %>%
dplyr::rename(state_abbv = state)
spatial_data <- left_join(df2,
get_urbn_map(map = "states", sf = TRUE),
by = "state_abbv") %>%
st_as_sf()
m <- ggplot() +
geom_sf(spatial_data, mapping = aes(fill = log(Avg.Pop))) +
geom_sf_text(data = get_urbn_labels(map = "states", sf = TRUE),
aes(label = state_abbv), size = 2) +
scale_fill_distiller("Log Average Population", palette="Spectral") +
labs(title = "Average population by State", x = "Longitude", y = "Latitude") +
theme(plot.title = element_text(hjust=0.5))
ggplotly(m)